Clinical Corpus Annotation: Challenges and Strategies

نویسندگان

Fei Xia

Meliha Yetisgen-Yildiz

چکیده

Annotation is an important task for Natural Language Processing (NLP), and the traditional annotation schema, including writing detailed guidelines and training annotators, has proved to work well in many previous annotation projects. However, making medical judgment on clinical data requires medical expertise and annotation can only be done by experts. Recently, we created three corpora for our clinical NLP studies: one marks critical recommendations in radiology reports, and the other two indicate whether a patient has pneumonia based on chest X-ray reports or ICU reports. All the annotations were done by medical experts. In this paper, we discuss various challenges we have encountered when dealing with expert annotation, and lay out some lessons we have learned from the annotation tasks. Our experiments show that medical training alone is not sufficient for achieving high inter-annotator agreement, and NLP researchers should get involved in the annotation process as early as possible despite their lack of medical training.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Languages under the influence: Building a database of Uralic languages

For most of the Uralic languages, there is a lack of systematically collected, consequently transcribed and morphologically annotated text corpora. This paper sums up the steps, the preliminary results and the future directions of building a linguistic corpus of some Uralic languages, namely Tundra Nenets, Udmurt, Synya Khanty, and Surgut Khanty. The experiences of building a corpus containing ...

متن کامل

Are We There Yet?: The Development of a Corpus Annotated for Social Acts in Multilingual Online Discourse

We present the AAWD and AACD corpora, a collection of discussions drawn from Wikipedia talk pages and small group IRC discussions in English, Russian and Mandarin. Our datasets are annotated with labels capturing two kinds of social acts: alignment moves and authority claims. We describe these social acts, discuss our annotation process, highlight challenges we encountered and strategies we emp...

متن کامل

Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi

The present paper describes an ongoing effort to compile and annotate a large corpus of computer-mediated communication (CMC) in Hindi. It describes the process of the compilation of the corpus, the basic structure of the corpus and the annotation of the corpus and the challenges faced in the creation of such a corpus. It also gives a description of the technologies developed for the processing...

متن کامل

Challenges in Automating Maze Detection

SALT is a widely used annotation approach for analyzing natural language transcripts of children. Nine annotated corpora are distributed along with scoring software to provide norming data. We explore automatic identification of mazes – SALT’s version of disfluency annotations – and find that cross-corpus generalization is very poor. This surprising lack of crosscorpus generalization suggests s...

متن کامل

A Methodology for Corpus Annotation through Crowdsourcing

In contrast to expert-based annotation, for which elaborate methodologies ensure high quality output, currently no systematic guidelines exist for crowdsourcing annotated corpora, despite the increasing popularity of this approach. To address this gap, we define a crowd-based annotation methodology, compare it against the OntoNotes methodology for expert-based annotation, and identify future ch...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Clinical Corpus Annotation: Challenges and Strategies

نویسندگان

چکیده

منابع مشابه

Languages under the influence: Building a database of Uralic languages

Are We There Yet?: The Development of a Corpus Annotated for Social Acts in Multilingual Online Discourse

Challenges in the development of annotated corpora of computer-mediated communication in Indian Languages: A Case of Hindi

Challenges in Automating Maze Detection

A Methodology for Corpus Annotation through Crowdsourcing

عنوان ژورنال:

اشتراک گذاری